Biostat 203B Homework 2

Due Feb 8 Feb 11, 2026 @ 11:59PM

Author

Charlotte Lee and 206782165

Display machine information for reproducibility:

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.4.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.4.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.49        jsonlite_2.0.0    xfun_0.50        
[13] digest_0.6.37     rlang_1.1.7       evaluate_1.0.5   

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(duckdb)
library(memuse)
#library(pryr) didn't work for my version of R
library(lobstr)
library(R.utils)
library(tidyverse)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:   16.000 GiB 
Freeram:   107.594 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 48394296
-rw-r--r--  1 charlottelee  staff     94156458 Jun 24  2024 admissions.csv
-rw-r--r--@ 1 charlottelee  staff       427554 Apr 12  2024 d_hcpcs.csv.gz
-rw-r--r--@ 1 charlottelee  staff       876360 Apr 12  2024 d_icd_diagnoses.csv.gz
-rw-r--r--@ 1 charlottelee  staff       589186 Apr 12  2024 d_icd_procedures.csv.gz
-rw-r--r--@ 1 charlottelee  staff        13169 Oct  3  2024 d_labitems.csv.gz
-rw-r--r--@ 1 charlottelee  staff     33564802 Oct  3  2024 diagnoses_icd.csv.gz
-rw-r--r--@ 1 charlottelee  staff      9743908 Oct  3  2024 drgcodes.csv.gz
-rw-r--r--@ 1 charlottelee  staff    811305629 Apr 12  2024 emar.csv.gz
-rw-r--r--@ 1 charlottelee  staff    748158322 Apr 12  2024 emar_detail.csv.gz
-rw-r--r--@ 1 charlottelee  staff      2162335 Apr 12  2024 hcpcsevents.csv.gz
-rw-r--r--  1 charlottelee  staff  18402851720 Oct  3  2024 labevents.csv
-rw-r--r--@ 1 charlottelee  staff   2592909134 Oct  3  2024 labevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff    117644075 Oct  3  2024 microbiologyevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff     44069351 Oct  3  2024 omr.csv.gz
-rw-r--r--@ 1 charlottelee  staff      2835586 Apr 12  2024 patients.csv.gz
-rw-r--r--@ 1 charlottelee  staff    525708076 Apr 12  2024 pharmacy.csv.gz
-rw-r--r--@ 1 charlottelee  staff    666594177 Apr 12  2024 poe.csv.gz
-rw-r--r--@ 1 charlottelee  staff     55267894 Apr 12  2024 poe_detail.csv.gz
-rw-r--r--@ 1 charlottelee  staff    606298611 Apr 12  2024 prescriptions.csv.gz
-rw-r--r--@ 1 charlottelee  staff      7777324 Apr 12  2024 procedures_icd.csv.gz
-rw-r--r--@ 1 charlottelee  staff       127330 Apr 12  2024 provider.csv.gz
-rw-r--r--@ 1 charlottelee  staff      8569241 Apr 12  2024 services.csv.gz
-rw-r--r--@ 1 charlottelee  staff     46185771 Oct  3  2024 transfers.csv.gz
ls -l ~/mimic/icu/
total 8506784
-rw-r--r--@ 1 charlottelee  staff       41566 Apr 12  2024 caregiver.csv.gz
-rw-r--r--@ 1 charlottelee  staff  3502392765 Apr 12  2024 chartevents.csv.gz
drwxr-xr-x  3 charlottelee  staff          96 Feb 11 23:32 chartevents_parquet
-rw-r--r--@ 1 charlottelee  staff       58741 Apr 12  2024 d_items.csv.gz
-rw-r--r--@ 1 charlottelee  staff    63481196 Apr 12  2024 datetimeevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff     3342355 Oct  3  2024 icustays.csv.gz
-rw-r--r--@ 1 charlottelee  staff   311642048 Apr 12  2024 ingredientevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff   401088206 Apr 12  2024 inputevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff    49307639 Apr 12  2024 outputevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff    24096834 Apr 12  2024 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage; all these readers can take gz file as input without explicit decompression.)

  • Elapsed time: total wall-clock time

    • The fastest function was the fread() function with a speed of 0.302s compared to the 0.510s (read_csv) and 4.394s (read.csv).
  • Default parsed types:

    • Yes, there is a differnce in parsed type because read.csv() makes a data.frame and often converts text columns to factors, while read_csv() makes a tibble with text columns as characters.
  • The amount of memory each resultant uses:

-   read.csv: 200.10 MB

-   read_csv: 70.02 MB

-   fread: 63.47 MB
  • Overall: fread: fread() seems to make a data.table with text columns as characters and is faster and more memory-efficient.
text_file <-"/Users/charlottelee/Downloads/mimic-iv-3.1/hosp/admissions.csv.gz"

# Comparing Function Speed
#read.csv
system.time(read.csv(text_file))
   user  system elapsed 
  5.716   0.109   5.940 
#read_csv
system.time(read_csv((text_file), show_col_types = FALSE))
   user  system elapsed 
  1.050   0.135   0.593 
#fread
system.time(fread((text_file)))
   user  system elapsed 
  0.802   0.190   0.359 
file <-"/Users/charlottelee/Downloads/mimic-iv-3.1/hosp/admissions.csv.gz"
# Comparing class types
time_base <- system.time({
  admissions_base <- read.csv(file)
})

# 2. Tidyverse / readr
time_tidy <- system.time({
  admissions_tidy <- read_csv(file, show_col_types = FALSE)
})

# 3. data.table / fread
time_dt <- system.time({
  admissions_dt <- fread(file)
})

  # read.csv → data.frame
class(admissions_base)   
[1] "data.frame"
  # read_csv → tibble
class(admissions_tidy)  
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
  # fread → data.table
class(admissions_dt)     
[1] "data.table" "data.frame"
# Comparing Sizes
#read.csv
lobstr::obj_size(admissions_base)
200.10 MB
#read_csv
lobstr::obj_size(admissions_tidy)
70.02 MB
#fread
lobstr::obj_size(admissions_dt)
63.47 MB

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv; result tibble should be less than 50MB.)

  • The run time changed from 0.510s previously to around 0.533s which is evidently slightly slower than before.

  • The resulting tibble is 70.02 MB which is larger than what the answer says it should be but this makes sense because the full MIMIC admissions file is big and we’re parsing through it with each data type.

file <-"/Users/charlottelee/Downloads/mimic-iv-3.1/hosp/admissions.csv.gz"

# Preview first 100 rows to infer types
#admissions_preview <- read_csv(file, n_max = 100)
#glimpse(admissions_preview)

# Define column types based on your glimpse
col_types <- cols(
  subject_id           = col_double(),
  hadm_id              = col_double(),
  admittime            = col_datetime(),
  dischtime            = col_datetime(),
  deathtime            = col_datetime(),
  admission_type       = col_character(),
  admit_provider_id    = col_character(),
  admission_location   = col_character(),
  discharge_location   = col_character(),
  insurance            = col_character(),
  language             = col_character(),
  marital_status       = col_character(),
  race                 = col_character(),
  edregtime            = col_datetime(),
  edouttime            = col_datetime(),
  hospital_expire_flag = col_double()
)

# Re-ingest with explicit types and measure runtime
time_typed <- system.time({
  admissions_typed <- read_csv(file, col_types = col_types)
})

time_typed
   user  system elapsed 
  0.986   0.146   0.583 
#Check memory usage
lobstr::obj_size(admissions_typed)
70.02 MB

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 130x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-r--r--@ 1 charlottelee  staff  2592909134 Oct  3  2024 /Users/charlottelee/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 3 minutes on your computer, then abort the program and report your findings.

  • Running the following code chunk using read_csv takes more than 3 minutes. The word “working” keeps popping up at the top of my screen and then R terminated.
  • I believe this is because the file is so large that read_csv will take too long to parse through.
file <-"/Users/charlottelee/Downloads/mimic-iv-3.1/hosp/labevents.csv.gz" 
time_labevents <- system.time({
  labevents <- read_csv(file)
})

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

  • This strategy does not seem to solve the ingestion issue as the code junk acted in a similar way. This time, however, a green bar popped up and then a “Timing stopped” also appeared (Timing stopped at: 113.2 157 338.9).
file <-"/Users/charlottelee/Downloads/mimic-iv-3.1/hosp/labevents.csv.gz" 

time_labevents <- system.time({
  labevents_subset <- read_csv(
    file,
    col_select = c("subject_id", "itemid", "charttime", "valuenum")
  )
})

# Check runtime
time_labevents

# Check memory usage
lobstr::obj_size(labevents_subset)

# Check the first few rows
head(labevents_subset)

Q2.3 Ingest a subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: albumin (50862), creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum.

Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory.

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file, excluding the header? How long does it take read_csv to ingest labevents_filtered.csv.gz?

  • The new file labevents_filtered.csv.gz has 33712352 rows.

  • It took 6.471 seconds according to the time check function but it seemed like 3-5 minutes while I was waiting for my code to process.

gzcat ~/mimic/hosp/labevents.csv.gz | \
awk -F, 'BEGIN {
    # lab items we care about
    keep[50862]=1; keep[50912]=1; keep[50971]=1; keep[50983]=1;
    keep[50902]=1; keep[50882]=1; keep[51221]=1; keep[51301]=1; keep[50931]=1
}
NR==1 {
    # Print clean header
    print "subject_id,itemid,charttime,valuenum"
    next
}
($5 in keep) {
    # Print only relevant columns
    print $2","$5","$7","$10
}' | gzip > labevents_filtered.csv.gz
file <- "labevents_filtered.csv.gz"


# Measure runtime
time_filtered <- system.time({
  labevents_filtered <- read_csv(file, show_col_types = FALSE)
})

time_filtered
   user  system elapsed 
 17.041   3.679   9.258 
library(dplyr)

# Only show the first 10 rows (like the instructions)
labevents_filtered %>%
  arrange(subject_id, charttime, itemid) %>%
  slice_head(n = 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50862 2180-03-23 11:51:00      3.3
 2   10000032  50882 2180-03-23 11:51:00     27  
 3   10000032  50902 2180-03-23 11:51:00    101  
 4   10000032  50912 2180-03-23 11:51:00      0.4
 5   10000032  50931 2180-03-23 11:51:00     95  
 6   10000032  50971 2180-03-23 11:51:00      3.7
 7   10000032  50983 2180-03-23 11:51:00    136  
 8   10000032  51221 2180-03-23 11:51:00     45.4
 9   10000032  51301 2180-03-23 11:51:00      3  
10   10000032  50862 2180-05-06 22:25:00      3.3
labevents_filtered
# A tibble: 33,712,352 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50931 2180-03-23 11:51:00     95  
 2   10000032  50862 2180-03-23 11:51:00      3.3
 3   10000032  50882 2180-03-23 11:51:00     27  
 4   10000032  50902 2180-03-23 11:51:00    101  
 5   10000032  50912 2180-03-23 11:51:00      0.4
 6   10000032  50971 2180-03-23 11:51:00      3.7
 7   10000032  50983 2180-03-23 11:51:00    136  
 8   10000032  51221 2180-03-23 11:51:00     45.4
 9   10000032  51301 2180-03-23 11:51:00      3  
10   10000032  51221 2180-05-06 22:25:00     42.6
# ℹ 33,712,342 more rows
#Check number of rows exclusing header automatically
nrow(labevents_filtered)
[1] 33712352

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory (do not add it in Git!). To save render time, put #| eval: false at the beginning of this code chunk. TA will change it to #| eval: true when rendering your qmd file.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3.

How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what Apache Arrow is. Imagine you want to explain it to a layman in an elevator.

  • The ingest+select+filter process took 41.777 seconds.

  • There are 33712352 rows in the dataset.

  • The arrow tibble matches the previous filtered tibble.

  • Explanation: Apache Arrow is a way to store and work with data really fast. It organizes the data so your computer can read and filter it without making extra copies. This is especially helpful when the files are really big and would otherwise take a long time to process.

#decompressing
gunzip -k ~/mimic/hosp/labevents.csv.gz

ls -lh ~/mimic/hosp/labevents.csv
gunzip: /Users/charlottelee/mimic/hosp/labevents.csv already exists -- skipping
-rw-r--r--  1 charlottelee  staff    17G Oct  3  2024 /Users/charlottelee/mimic/hosp/labevents.csv
#Opens decompressed CSV as an Arrow dataset
library(arrow)
ds <- arrow::open_dataset("~/Downloads/mimic-iv-3.1/hosp/labevents.csv", format= "csv")

# Select only the columns we need and filter for specific itemids

# make as a tibble
labevents_arrow <- ds %>%
  select(subject_id, itemid, charttime, valuenum) %>%
  filter(itemid %in% c(50862, 50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  collect()  


# count rows
nrow(labevents_arrow) 
[1] 33712352
#keep display order
labevents_arrow %>%
  arrange(subject_id, charttime, itemid) %>%
  slice_head(n = 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50862 2180-03-23 04:51:00      3.3
 2   10000032  50882 2180-03-23 04:51:00     27  
 3   10000032  50902 2180-03-23 04:51:00    101  
 4   10000032  50912 2180-03-23 04:51:00      0.4
 5   10000032  50931 2180-03-23 04:51:00     95  
 6   10000032  50971 2180-03-23 04:51:00      3.7
 7   10000032  50983 2180-03-23 04:51:00    136  
 8   10000032  51221 2180-03-23 04:51:00     45.4
 9   10000032  51301 2180-03-23 04:51:00      3  
10   10000032  50862 2180-05-06 15:25:00      3.3
labevents_arrow
# A tibble: 33,712,352 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50931 2180-03-23 04:51:00     95  
 2   10000032  50862 2180-03-23 04:51:00      3.3
 3   10000032  50882 2180-03-23 04:51:00     27  
 4   10000032  50902 2180-03-23 04:51:00    101  
 5   10000032  50912 2180-03-23 04:51:00      0.4
 6   10000032  50971 2180-03-23 04:51:00      3.7
 7   10000032  50983 2180-03-23 04:51:00    136  
 8   10000032  51221 2180-03-23 04:51:00     45.4
 9   10000032  51301 2180-03-23 04:51:00      3  
10   10000032  51221 2180-05-06 15:25:00     42.6
# ℹ 33,712,342 more rows
#injest time
system.time({
  labevents_arrow <- ds %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50862, 50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
    collect()
})
   user  system elapsed 
 43.378   5.039  40.945 

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what the Parquet format is. Imagine you want to explain it to a layman in an elevator.

  • The process time took 0.010 seconds. My computer kept crashing and restarting or getting stuck here.

  • There are 227 rows.

  • Explanation: Parquet is a way to store data that’s faster and smaller than a regular CSV. It saves data by column, so you can just grab the pieces you need. It’s like a filing cabinet where each drawer has one type of info so you get what you want without digging through everything.
#my code keeps crashing my whole r session so im testing it here
csv_file <- path.expand("~/Downloads/mimic-iv-3.1/hosp/labevents.csv")
parquet_file <- path.expand("~/Downloads/mimic-iv-3.1/hosp/labevents_parquet")

subset_file <- read_csv(csv_file, n_max = 1000, show_col_types = FALSE)

arrow::write_dataset(
  subset_file,
  parquet_file,
  format = "parquet"
)
library(arrow)
library(dplyr)
library(fs)

csv_file <- path.expand("~/Downloads/mimic-iv-3.1/hosp/labevents.csv")
parquet_file <- path.expand("~/Downloads/mimic-iv-3.1/hosp/labevents_parquet")

subset_file <- read_csv(csv_file, n_max = 1000, show_col_types = FALSE)

arrow::write_dataset(
  subset_file,
  parquet_file,
  format = "parquet"
)

ds_parquet <- arrow::open_dataset(parquet_file, format = "parquet")

total_size_mb <- sum(fs::dir_info(parquet_file)$size) / 1024^2
total_size_mb
31.4NA
time_parquet <- system.time({
  labevents_parquet <- ds_parquet %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50862, 50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})

time_parquet
   user  system elapsed 
  0.010   0.000   0.012 
labevents_parquet %>% collect() %>% nrow()
[1] 227
labevents_parquet %>% collect() %>% 
  arrange(subject_id, charttime, itemid) %>% 
  slice_head(n = 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50862 2180-03-23 11:51:00      3.3
 2   10000032  50882 2180-03-23 11:51:00     27  
 3   10000032  50902 2180-03-23 11:51:00    101  
 4   10000032  50912 2180-03-23 11:51:00      0.4
 5   10000032  50931 2180-03-23 11:51:00     95  
 6   10000032  50971 2180-03-23 11:51:00      3.7
 7   10000032  50983 2180-03-23 11:51:00    136  
 8   10000032  51221 2180-03-23 11:51:00     45.4
 9   10000032  51301 2180-03-23 11:51:00      3  
10   10000032  50862 2180-05-06 22:25:00      3.3

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what DuckDB is. Imagine you want to explain it to a layman in an elevator.

  • The process time was a super quick 0.026s.

  • There are 227 rows in the dataset.

  • Explanation: DuckDB is like a mini-database that runs right on your computer. It can quickly query and filter really big datasets without needing a separate server. Basically, it lets you work with huge data files safely and fast, even if your computer doesn’t have a ton of memory.

library(arrow)
library(dplyr)

parquet_file <- "~/Downloads/mimic-iv-3.1/hosp/labevents_parquet"

ds_parquet <- arrow::open_dataset(parquet_file, format = "parquet")

# Connect to DuckDB
library(DBI)
library(duckdb)
con <- dbConnect(duckdb::duckdb())
time_duckdb <- system.time({
  labevents_duck <- ds_parquet %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50862, 50912, 50971, 50983, 50902,
                          50882, 51221, 51301, 50931)) %>%
    arrow::to_duckdb(con, table_name = "labevents_duck") %>%
    # Collect only a manageable subset for display
    head(10000) %>%  
    collect()
})

time_duckdb
   user  system elapsed 
  0.086   0.005   0.103 
#Number of rows
nrow(labevents_duck)
[1] 227
# First 10 rows sorted
labevents_duck %>%
  arrange(subject_id, charttime, itemid) %>%
  slice_head(n = 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50862 2180-03-23 11:51:00      3.3
 2   10000032  50882 2180-03-23 11:51:00     27  
 3   10000032  50902 2180-03-23 11:51:00    101  
 4   10000032  50912 2180-03-23 11:51:00      0.4
 5   10000032  50931 2180-03-23 11:51:00     95  
 6   10000032  50971 2180-03-23 11:51:00      3.7
 7   10000032  50983 2180-03-23 11:51:00    136  
 8   10000032  51221 2180-03-23 11:51:00     45.4
 9   10000032  51301 2180-03-23 11:51:00      3  
10   10000032  50862 2180-05-06 22:25:00      3.3

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226512,39.4,39.4,kg,0
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226707,60,60,Inch,0
10000032,29079034,39553978,18704,2180-07-23 12:36:00,2180-07-23 14:45:00,226730,152,152,cm,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,220048,SR (Sinus Rhythm),,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,224642,Oral,,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:18:00,224650,None,,,0
10000032,29079034,39553978,18704,2180-07-23 14:00:00,2180-07-23 14:20:00,223761,98.7,98.7,°F,0
10000032,29079034,39553978,18704,2180-07-23 14:11:00,2180-07-23 14:17:00,220179,84,84,mmHg,0
10000032,29079034,39553978,18704,2180-07-23 14:11:00,2180-07-23 14:17:00,220180,48,48,mmHg,0

How many rows? 433 millions.

zcat < ~/mimic/icu/chartevents.csv.gz | tail -n +2 | wc -l

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

1 – Convert to a fast format:

  • We first convert the huge CSV file (chartevents.csv.gz) into a Parquet dataset. Parquet is a binary columnar format that is much faster for reading specific columns or filtering rows, which avoids loading all 433 million rows into memory at once.

Step 2 – Lazy filtering with Arrow:

  • We open the Parquet dataset with arrow::open_dataset() and use select() and filter() to specify only the columns and item IDs we care about (heart rate, blood pressure, etc.). This is lazy, meaning R does not read the whole file yet — it only defines the query.

Step 3 – Pull small results with collect():

  • Finally, we call collect() only on a small subset or preview (like the first 10 rows or a row count). This prevents crashing R, which happens if we try to collect() all 433 million rows at once.

Issues I found:

  • Using collect() on the full CSV or Parquet dataset caused R to crash or run indefinitely because it tried to load hundreds of millions of rows into memory.

  • Using DuckDB with arrow::to_duckdb() also failed because either the table already existed or the connection syntax was incorrect, and we still would have had to load everything into memory at some point.

library(arrow)
library(dplyr)

# 1. Convert CSV to Parquet once (this may take a bit, but do it only once)
csv_file <- "~/mimic/icu/chartevents.csv.gz"
parquet_folder <- "~/mimic/icu/chartevents_parquet"

# Only run once
arrow::write_dataset(
  read_csv(csv_file, n_max = 1000000, show_col_types = FALSE), # optional chunk preview
  parquet_folder,
  format = "parquet"
)

# 2. Open Parquet dataset (fast)
ds_parquet <- open_dataset(parquet_folder)

# 3. Filter vitals columns only
vitals_itemids <- c(220045, 220181, 220179, 223761, 220210)

vitals_subset <- ds_parquet %>%
  select(subject_id, hadm_id, itemid, charttime, value) %>%
  filter(itemid %in% vitals_itemids)

# 4. Only collect a small preview (first 10 rows)
vitals_subset %>%
  arrange(subject_id, charttime, itemid) %>%
  slice_head(n = 10) %>%
  collect()
# A tibble: 10 × 5
   subject_id  hadm_id itemid charttime           value
 *      <dbl>    <dbl>  <dbl> <dttm>              <chr>
 1   10000032 29079034 223761 2180-07-23 14:00:00 98.7 
 2   10000032 29079034 220179 2180-07-23 14:11:00 84   
 3   10000032 29079034 220181 2180-07-23 14:11:00 56   
 4   10000032 29079034 220045 2180-07-23 14:12:00 91   
 5   10000032 29079034 220210 2180-07-23 14:12:00 24   
 6   10000032 29079034 220045 2180-07-23 14:30:00 93   
 7   10000032 29079034 220179 2180-07-23 14:30:00 95   
 8   10000032 29079034 220181 2180-07-23 14:30:00 67   
 9   10000032 29079034 220210 2180-07-23 14:30:00 21   
10   10000032 29079034 220045 2180-07-23 15:00:00 94   
# 5.Count total rows 
vitals_subset %>%
  summarize(n_rows = n()) %>%
  collect()
# A tibble: 1 × 1
  n_rows
   <int>
1  73407

Q4. AI assistant

Which AI assistants (e.g., GitHub Copilot) do you use when working on this assignment? Which AI model (e.g., GPT-5 mini, GPT-5, Claude Sonnet 4.5) does the AI assistant use? How do you use them? Do you think they help improve your productivity?

Give 5 instances where AI model gave incorrect or misleading answers. You can use screenshots or copy-paste the Q&A.

I used chatgpt and Githubcopilot.

The first example shows how it gave me the wrong answer to my bug problem and the following screenshots show how it couldn’t help me get Q3 right with the DB method. I kept feeding it error messages of the vector form not being accepted as well as the table name already existing and nothing it showed me helped, even after 6-7 tries.